Big Data Analysis in Finance

Module 2

Prof. Matthew G. Son

University of South Florida

Data merge/join

Data joining: keys

It is unlikely that data involves only a single data frame

  • You must join tables together

When you want to join two tables, you must find keys:

  1. Primary key: set of variable(s) that uniquely identifies each observation.
  2. Foreign key: set of variable(s) that corresponds to primary key in another table.

Example: Financials and Stock prices

Stock price data: contains daily stock prices

  • key: symbol and date
library(tidyquant)
symbols <- c("AAPL","MSFT", "TSLA", "GME")
stock_prices <- tq_get(symbols, from = "2023-01-01", to = "2023-12-31")
stock_prices |> slice_head(n = 1, by = symbol)
# A tibble: 4 × 8
  symbol date        open  high   low close    volume adjusted
  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 AAPL   2023-01-03 130.  131.  124.  125.  112117500    124. 
2 MSFT   2023-01-03 243.  246.  237.  240.   25740000    235. 
3 TSLA   2023-01-03 118.  119.  105.  108.  231402800    108. 
4 GME    2023-01-03  18.6  19.3  17.1  17.2   5135200     17.2

Index data: contains index constituents

  • key: symbol
sp500_index <- tq_index("SP500")
sp500_index |> head(4)
# A tibble: 4 × 8
  symbol company       identifier sedol weight sector shares_held local_currency
  <chr>  <chr>         <chr>      <chr>  <dbl> <chr>        <dbl> <chr>         
1 AAPL   APPLE INC     037833100  2046… 0.0681 -        183025769 USD           
2 MSFT   MICROSOFT CO… 594918104  2588… 0.0604 -         90574052 USD           
3 NVDA   NVIDIA CORP   67066G104  2379… 0.0599 -        298381059 USD           
4 AMZN   AMAZON.COM I… 023135106  2000… 0.0384 -        114917259 USD           

Checking primary keys

How can we identify primary keys?

  • Browse the dictionary, data and see which variable overlaps

  • Make sure the key(s) uniquely identify each observations (rows) for the table

\[ \text{# of distinct values of key} == \text{N} \]

The number of distinct values of the key should equal to total number of rows.

Guide on primary key checking

Step 1. Check if missing values

If key variable(s) contain missing values, it might be from data corruption.

If it contains missing values, consider dropping missing rows, or search other potential keys.

# symbol and date are key variables
stock_prices |> 
  filter(if_any(c(symbol, date), is.na)) |> 
  nrow()
[1] 0

Step 2. Check if uniquely identify

# Check number of distinct values equals num obervations
stock_prices |>
  summarize(n_distinct(symbol,date) == n()) 
# A tibble: 1 × 1
  `n_distinct(symbol, date) == n()`
  <lgl>                            
1 TRUE                             

Exercise

Load tidyverse and tidyquant.

  1. Download stock prices and name it as stock_prices
  • Tickers: F and GME
  • Dates: Whole year 2023
  1. Download S&P 500 index holdings as sp500_holdings
  • Use tq_index("SP500")
  1. How many observations (nrows) are found in sp500_holdings?

  2. Check if there are any missing values in key variable in the sp500_holdings

  3. Does the key variable uniquely identify all observations?

Basic joins (Equi-joins)

Basic joins are when both primary and foreign keys exactly match.

  1. Mutating joins:

    • Primary purpose is to add more columns

    • left_join(), right_join(), inner_join(), full_join()

  2. Filtering joins:

    • Primary purpose is to filter rows

    • semi_join(), anti_join()

Join diagrams

We have to determine what happens when keys are NOT MATCHED

Left Join

When non-matching, NA fill (left alive) or drop (right die).

Right Join

When non-matching, drop (left die), NA fill (right alive).

Mutating joins (Left Join)

left_data |> 
  left_join(
    right_data,
    by = join_by(left_key == right_key)
  )
  • The information from the left data.frame is not lost
  • Most intuitive and popular join

Left join sp500_index to stock_prices

stock_prices_on_left <- stock_prices |> # stock_prices on the left
  left_join(sp500_index, by = join_by(symbol))
stock_prices_on_left |> head(3)
# A tibble: 3 × 15
  symbol date        open  high   low close   volume adjusted company identifier
  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl> <chr>   <chr>     
1 AAPL   2023-01-03  130.  131.  124.  125.   1.12e8     124. APPLE … 037833100 
2 AAPL   2023-01-04  127.  129.  125.  126.   8.91e7     125. APPLE … 037833100 
3 AAPL   2023-01-05  127.  128.  125.  125.   8.10e7     124. APPLE … 037833100 
# ℹ 5 more variables: sedol <chr>, weight <dbl>, sector <chr>,
#   shares_held <dbl>, local_currency <chr>

Conversely, left join stock_prices to sp500_index

sp500_index_on_left <- sp500_index |> # sp500_index on the left
  left_join(stock_prices, by = join_by(symbol))
sp500_index_on_left |> head(3)
# A tibble: 3 × 15
  symbol company   identifier sedol   weight sector shares_held local_currency
  <chr>  <chr>     <chr>      <chr>    <dbl> <chr>        <dbl> <chr>         
1 AAPL   APPLE INC 037833100  2046251 0.0681 -        183025769 USD           
2 AAPL   APPLE INC 037833100  2046251 0.0681 -        183025769 USD           
3 AAPL   APPLE INC 037833100  2046251 0.0681 -        183025769 USD           
# ℹ 7 more variables: date <date>, open <dbl>, high <dbl>, low <dbl>,
#   close <dbl>, volume <dbl>, adjusted <dbl>

Which data is considered left makes difference!

dim(stock_prices_on_left)
[1] 1000   15
dim(sp500_index_on_left)
[1] 1251   15

Exercise (Homework)

Refer to below sample code & syntax:

Load tidyverse and tidyquant.

  1. Download stock prices and name it as stock_prices
  • Tickers: F and GME
  • Dates: Whole year 2023
  1. Download S&P 500 index holdings as sp500_holdings
  • Use tq_index("SP500")

Exercise (continued)

  1. Left join sp500_holdings to stock_prices. How many rows does it have?
  2. Left join stock_prices to sp500_holdings. How many rows?
  3. Right join sp500_holdings to stock_prices. How many rows does it have?
  4. Right join stock_prices to sp500_holdings. How many rows?
  5. Inner join sp500_holdings to stock_prices. How many rows does it have?

7a. Inner join stock_prices to sp500_holdings. What’s the difference?

  1. Outer join sp500_holdings to stock_prices. How many rows does it have?

8a. Outer join stock_prices to sp500_holdings. How many rows?

Filtering joins:

When you want to filter rows as the result of join. No new columns generated.

semi_join(): Keep all rows in left frame that have match in right frame

stock_prices |> 
  semi_join(sp500_index, by = join_by(symbol)) |> 
  slice_head(n = 2, by = symbol)
# A tibble: 6 × 8
  symbol date        open  high   low close    volume adjusted
  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
1 AAPL   2023-01-03  130.  131.  124.  125. 112117500     124.
2 AAPL   2023-01-04  127.  129.  125.  126.  89113600     125.
3 MSFT   2023-01-03  243.  246.  237.  240.  25740000     235.
4 MSFT   2023-01-04  232.  233.  226.  229.  50623400     225.
5 TSLA   2023-01-03  118.  119.  105.  108. 231402800     108.
6 TSLA   2023-01-04  109.  115.  108.  114. 180389000     114.

anti_join(): Keep all rows that that is not found in the right frame.

stock_prices |> 
  anti_join(sp500_index, by = join_by(symbol)) |> 
  slice_head(n = 2, by = symbol)
# A tibble: 2 × 8
  symbol date        open  high   low close  volume adjusted
  <chr>  <date>     <dbl> <dbl> <dbl> <dbl>   <dbl>    <dbl>
1 GME    2023-01-03  18.6  19.3  17.1  17.2 5135200     17.2
2 GME    2023-01-04  17.2  17.9  16.9  17.3 3939300     17.3

Exercise

Type the code below and see what’s happening:

stock_prices |> 
  semi_join(sp500_index, by = join_by(symbol)) |> 
  slice_head(n = 2, by = symbol)

stock_prices |> 
  anti_join(sp500_index, by = join_by(symbol)) |> 
  slice_head(n = 2, by = symbol)

Advanced joins 1

1:M (One to Many) matching

When foreign key is NOT uniquely identifying each observations:

Inner join with duplicate key

A copy of row is generated (i.e. broadcasted) to guarantee matching.

Important

What would happen if you perform left / right / inner / outer (full) join when you have 1:M relationship?

Think about in terms of number of rows you will eventually have.

Advanced joins 2

Non-equi joins:

when join includes inequality (e.g. key1 >= key2)

  • Cross joins : match every pair, Cartesian product (M:M matching)

  • Inequality joins: uses <, >, >=, <= instead of ==

  • Rolling joins: inequality with only closest matching (time series)

  • Overlap joins: join when within a range

Non-equi join example

Rolling joins

Suppose you have data in two different time frequency.

Stock trading data in minute-level frequency, and risk-free rate in daily (end-of-day) frequency.

How can you match the most recent risk-free rate to each stock trading data?

Non-equi join example 2

Overlap joins

Suppose a specific company was merged and its ticker symbol changed from “AABB” to “AAC”, then to “ABC”.

Some other company started to use “AABB” as its ticker.

In this case, when joining, you want to match ticker AND with specific date that is between dates when ticker symbol was active for the specific company.

SQL

SQL

Structured Query Language

SQL is a query language for storing and processing information in relational database.

It is standardized and widely used database language.

SQL basic verbs

Like dplyr, SQL has its own (but archaic) syntax:

  • SELECT: choose variables for the table

  • GROUP BY: group rows of data by grouper variable

  • ORDER BY: arrange the rows by the variable

  • WHERE: filter rows with given condition

  • HAVING: Similar to where but used with group by to filter data

  • JOIN: join tables, defaults INNER

  • LIMIT: show the head of table

  • WITH: temporary table for subquery

SQL and dplyr

SELECT, WHERE

WHERE is filtering function:

SELECT symbol, date, adjusted
FROM stock_prices
WHERE symbol = 'AAPL';

Is equivalent to

stock_prices |> 
  filter(symbol == "AAPL") |> 
  select(symbol, date, adjusted)

GROUP BY + Mutate

SELECT symbol, AVG(adjusted) AS avg_price, COUNT(*) AS num_days
FROM stock_prices
GROUP BY symbol;

is equivalent to

stock_prices |> 
  group_by(symbol) |> 
  summarize(avg_price = mean(adjusted), num_days = n())

HAVING

HAVING is a another filtering that is evaluated after the GROUP BY operation.

SELECT 
  symbol, 
  AVG(adjusted) AS avg_price, 
  SUM(volume) AS total_volume
FROM stock_prices
WHERE adjusted > 10
GROUP BY symbol
HAVING AVG(adjusted) > 50;
stock_prices |>
  filter(adjusted > 10) |>
  group_by(symbol) |>
  summarize(
    avg_price = mean(adjusted),
    total_volume = sum(volume)
  ) |>
  filter(avg_price > 50)

JOIN

JOIN is INNER JOIN by default. Use LEFT JOIN for left join, etc.

SELECT sp.symbol, sp.date, sp.adjusted, si.sector
FROM stock_prices sp -- alias sp
JOIN sp500_index si ON sp.symbol = si.symbol;

Equivalent to

stock_prices |> 
  inner_join(sp500_index, by = join_by(symbol)) |> 
  select(symbol, date, adjusted, sector)

Complex Query

Subqueries in SQL are written with “WITH”

WITH high_weight_symbols AS (
  SELECT symbol
  FROM sp500_index
  WHERE weight > 0.03
) 
SELECT 
  symbol, 
  AVG(adjusted) AS avg_price, 
  SUM(volume) AS total_volume
FROM stock_prices
WHERE symbol IN (SELECT symbol FROM high_weight_symbols)
GROUP BY symbol
HAVING AVG(adjusted) > 50;

In dplyr, store as interm object:

high_weight_symbols <- sp500_index |>
  filter(weight > 0.03) |>
  pull(symbol)

stock_prices |>
  filter(symbol %in% high_weight_symbols) |>
  group_by(symbol) |>
  summarize(
    avg_price = mean(adjusted),
    total_volume = sum(volume)
  ) |>
  filter(avg_price > 50)

Notes on SQL

SQL is (somehow) easier to read, but it does NOT guarantee that the execution is in the order of writing.

  • The query usually starts with SELECT, but it is executed later.

  • Confusingly, you have to write things in pre-defined order (SELECT then FROM then WHERE etc…)

  • Also confusingly, some verbs have different nuances but does similar job (and vice versa)

    • WHERE and HAVING

    • GROUP BY and PARTITION BY

    • etc.

Notes on SQL

Source: Julia Evans, “Become a Select STAR”

Source: Julia Evans, “Become a Select STAR”

Database in R

Prep

Connections pane in IDEs is helpful to understand and browse database. To enable, first install:

install.packages("duckdb") # Our in-memory DB engine
install.packages("remotes") # Need remotes package for github pacakge installation
remotes::install_github("rstudio/connections") # Install package from github

Open and Close connections

You can make a databse connection to anywhere:

  • Local memory
  • Local disk
  • Remote databse
library(connections)
# by default, it is in-memory connection
con <- connection_open(db_engine_name) 

When finished, close connection with:

connection_close(con)

dbplyr: dplyr to SQL Databases

Tip

You can still use dplyr syntax to perform query on SQL databases, by using dplyr verbs on dbplyr object.

Package dbplyr converts dplyr to SQL code.

  • dbplyr is called backend of dplyr

Database Simulation

Creating in-memory database

Here I use duckdb engine with dbplyr.

Copy dataframe (in R memory) to database (in memory).

library(connections)
library(dbplyr)
# establish database connection (in-memory)
con <- connection_open(duckdb::duckdb()) 

# Copying data to database connection
copy_to(con, stock_prices, name = "stock_price_table")
copy_to(con, sp500_index, name = "sp500_index_table")

Example 1

Running SQL query on the database

tbl(con, 
  sql("SELECT symbol, date, adjusted
    FROM stock_price_table
    WHERE symbol = 'AAPL' 
    LIMIT 5"))
# Source:   SQL [?? x 3]
# Database: DuckDB v1.1.3 [root@Darwin 24.3.0:R 4.4.1/:memory:]
  symbol date       adjusted
  <chr>  <date>        <dbl>
1 AAPL   2023-01-03     124.
2 AAPL   2023-01-04     125.
3 AAPL   2023-01-05     124.
4 AAPL   2023-01-06     128.
5 AAPL   2023-01-09     129.

To use dbplyr, we need to assign R symbol to the database table.

# Assign an R object name to use dbplyr
stock_price_db <- tbl(con, "stock_price_table")
sp500_index_db <- tbl(con, "sp500_index_table")

Now dplyr syntax works on the database:

stock_price_db |> 
  filter(symbol == "AAPL") |> 
  select(symbol, date, adjusted) |> 
  head(5)

You can see how translation is working with show_query()

stock_price_db |> 
  filter(symbol == "AAPL") |> 
  select(symbol, date, adjusted) |> 
  head(5) |> 
  show_query()

Exercise

  1. Write an SQL query that selects symbol, date, adjusted, and volume from the stock_prices table where volume is less than 50,000,000 and adjusted is greater than 45.

  2. Translate the SQL query into dplyr code using filter() and select()

  3. Construct an SQL query that groups the stock_prices table by symbol and calculates the maximum and minimum values of adjusted (naming them max_price and min_price) as well as the count of trading days (as num_days).

  4. Rewrite above query into dplyr, and confirm the results.

Example 2

Complex query example:

tbl(con, sql("
WITH high_weight_symbols AS (
  SELECT symbol
  FROM sp500_index_table
  WHERE weight > 0.03
) 
SELECT 
  symbol, 
  AVG(adjusted) AS avg_price, 
  SUM(volume) AS total_volume
FROM stock_price_table
WHERE symbol IN (SELECT symbol FROM high_weight_symbols)
GROUP BY symbol
HAVING AVG(adjusted) > 50
"))
# Source:   SQL [?? x 3]
# Database: DuckDB v1.1.3 [root@Darwin 24.3.0:R 4.4.1/:memory:]
  symbol avg_price total_volume
  <chr>      <dbl>        <dbl>
1 MSFT        310.   6918889900
2 AAPL        171.  14804257200

With dbplyr:

subqury <- sp500_index_db |>
      filter(weight > 0.03) |>
      pull(symbol)

stock_price_db |>
  filter(symbol %in% subqury) |>
  group_by(symbol) |>
  summarize(
    avg_price = mean(adjusted),
    total_volume = sum(volume)
  ) |>
  filter(avg_price > 50)
# Source:   SQL [?? x 3]
# Database: DuckDB v1.1.3 [root@Darwin 24.3.0:R 4.4.1/:memory:]
  symbol avg_price total_volume
  <chr>      <dbl>        <dbl>
1 MSFT        310.   6918889900
2 AAPL        171.  14804257200

Exercise (Challenge)

  1. Write an SQL query that

    • A subquery named as active_stocks to extract symbols from sp500_index where weight is above 0.025.
    • Query stock_prices to calculate the average adjusted price (as avg_price) and total volume (as total_volume) and number of days (as trading_days) for these symbols,
    • grouping by symbol and show only those symbols with at least 30 trading days (HAVING trading_days >= 30)
  2. Convert into dplyr.

Database Simulation 2

Creating a local (disk) database

Create on-disk database, with explicit directory address:

# Establish a disk-based connection 
con_disk <- connection_open(duckdb::duckdb(), dbdir = "disk_db.duckdb") 

# Copying data to database connection
copy_to(con_disk, stock_prices, name = "stock_price_table_disk", temporary = FALSE)
copy_to(con_disk, sp500_index, name = "sp500_index_table_disk", temporary = FALSE)
# Assign an R object name to use dbplyr
stock_price_db_disk <- tbl(con_disk, "stock_price_table") 
sp500_index_db_disk <- tbl(con_disk, "sp500_index_table")

The remaining workflow will be the same as shown in in-memory database.

Caveats: On-Disk Databases

On-Disk database can handle datasets larger than the system’s memory capacity.

However, overall performance still depends on the system’s memory and disk performance.

  • Processing Considerations:
    • Memory Usage: Complex operations (e.g., joins, sorts) may still be memory intensive.
    • Disk I/O: Query performance can be influenced by disk speed.

Caveats: dbplyr and SQL

Complex dplyr syntax may not be fully translate-able to SQL and vice versa. If you use database often, it is worthwhile to learn.

See “The dynamic duo: SQL & R”